System and method for audio classification based on unsupervised attribute learning

ABSTRACT

Described is an audio classification system for classifying audio signals. In operation, the system extracts salient patches from an intensity spectrogram of an audio signal. Thereafter, multi-scale global average pooling (GAP) features are extracted for all salient patches. The GAP features are clustered, with each cluster becoming a key attribute. A test audio signal can then be mapped onto a histogram of key attributes. Based on the histogram, the test audio signal can then be classified as a sound class, allowing for operation of a device based on the classification of the sound class.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims the benefit of and is a non-provisional patentapplication of U.S. 62/581,625, filed on Nov. 3, 2017, the entirety ofwhich is hereby incorporated by reference.

BACKGROUND OF INVENTION (1) Field of Invention

The present invention relates to an audio classification system and,more specifically, to a system for classifying audio signals based onunsupervised attributed learning.

(2) Description of Related Art

Audio classifiers are designed to classify an input audio signal. Suchsystems are often implemented in speech recognition systems where thesystem classifies the audio signal as a particular word. However, inautonomous vehicular systems, a need exists to classify random audiosignals as particular objects, object interactions, or potentialobstacles, not just particular words. Due to the randomness of thepresented audio signal, current audio classifiers are easily fooled bynoise and mixing of different sounds. Further, the false alarmsgenerated by existing audio classifiers are not explainable.

In a somewhat related art, machine vision methods have been implementedto probe each unit of a convolutional neural network (CNN) to obtainimage regions with highest activations per unit (see Zhou et al. andGonzalez-Garcia et al. in the List of Incorporated LiteratureReferences, Literature Reference Nos. 6 and 7). Machine vision methodsare related because audio signals can be converted into spectrograms,which are essentially images. The above methods, however, suffer fromseveral major disadvantages. For example, these methods require a humanin the feedback loop to identify a common theme or concept that existsbetween top scoring regions. Further, existing methods focus onanalyzing neurons with the highest activations and neglect the neuralactivation patterns over the entire network for object classification.

Thus, a continuing need exists for a robust and fool proof audioclassification system that implements an attribute-oriented soundclassifier and that is operable in domains such as autonomous drivingand rotorcraft operation.

SUMMARY OF INVENTION

This disclosure provides an audio classification system for classifyingaudio signals. In various embodiments, the system includes one or moreprocessors and a memory, the memory being a non-transitorycomputer-readable medium having executable instructions encoded thereon,such that upon execution of the instructions, the one or more processorsperform several operations. In operation, the system extracts salientpatches from an intensity spectrogram of an audio signal. Thereafter,neural-network feature vectors are extracted for all salient patches.The feature vectors are then clustered, with each cluster becoming a keyattribute. The process of extracting salient patches and extracting thefeature vectors for the salient patches can be repeated for many audiosignals in the training data; whereas the clustering is performed on thefeatures for the whole training data set. A test audio signal can thenbe mapped onto a histogram of key attributes. Based on the histogram,the test audio signal can then be classified as a sound class, allowingfor operation of a device based on the classification of the soundclass.

In another aspect, the salient patches are extracted based on a neuralnetwork's activation for each spectrogram pixel or group of pixels.

In yet another aspect, the neural-network feature vectors aremulti-scale global average pooling (GAP) features that are multi-scalefeature vectors computed based on activations of at least two layers ina neural network.

Additionally, classifying the test audio signal includes input featurevectors from the neural network in response to an intensity spectrogramto be classified.

In yet another aspect, the device is an autonomous vehicle, such thatautonomous vehicle performs a physical maneuver operation based on theclassification of the sound class.

Additionally, in clustering the GAP features, the GAP features areclustered using iterative unsupervised learning.

Finally, the present invention also includes a computer program productand a computer implemented method. The computer program product includescomputer-readable instructions stored on a non-transitorycomputer-readable medium that are executable by a computer having one ormore processors, such that upon execution of the instructions, the oneor more processors perform the operations listed herein. Alternatively,the computer implemented method includes an act of causing a computer toexecute such instructions and perform the resulting operations.

BRIEF DESCRIPTION OF THE DRAWINGS

The objects, features and advantages of the present invention will beapparent from the following detailed descriptions of the various aspectsof the invention in conjunction with reference to the followingdrawings, where:

FIG. 1 is a block diagram depicting the components of a system accordingto various embodiments of the present invention;

FIG. 2 is an illustration of a computer program product embodying anaspect of the present invention;

FIG. 3 is a flow chart illustrating process flow of a system accordingto various embodiments of the present invention;

FIG. 4 is a flow chart illustrating a process according to variousembodiments of the present invention where salient patches are extractedfrom an input;

FIG. 5 is an illustration depicting extraction of multi-scale globalaverage pooling (GAP) features from spectrogram patches according tovarious embodiments of the present invention;

FIG. 6 is an illustration system architecture according to variousembodiments of the present invention;

FIG. 7 is an illustration depicting samples of sound attribute groupsand corresponding classifications; and

FIG. 8 is a block diagram depicting control of a device according tovarious embodiments.

DETAILED DESCRIPTION

The present invention relates to an audio classification system and,more specifically, to a system for classifying audio signals based onunsupervised attributed learning. The following description is presentedto enable one of ordinary skill in the art to make and use the inventionand to incorporate it in the context of particular applications. Variousmodifications, as well as a variety of uses in different applicationswill be readily apparent to those skilled in the art, and the generalprinciples defined herein may be applied to a wide range of aspects.Thus, the present invention is not intended to be limited to the aspectspresented, but is to be accorded the widest scope consistent with theprinciples and novel features disclosed herein.

In the following detailed description, numerous specific details are setforth in order to provide a more thorough understanding of the presentinvention. However, it will be apparent to one skilled in the art thatthe present invention may be practiced without necessarily being limitedto these specific details. In other instances, well-known structures anddevices are shown in block diagram form, rather than in detail, in orderto avoid obscuring the present invention.

The reader's attention is directed to all papers and documents which arefiled concurrently with this specification and which are open to publicinspection with this specification, and the contents of all such papersand documents are incorporated herein by reference. All the featuresdisclosed in this specification, (including any accompanying claims,abstract, and drawings) may be replaced by alternative features servingthe same, equivalent or similar purpose, unless expressly statedotherwise. Thus, unless expressly stated otherwise, each featuredisclosed is one example only of a generic series of equivalent orsimilar features.

Furthermore, any element in a claim that does not explicitly state“means for” performing a specified function, or “step for” performing aspecific function, is not to be interpreted as a “means” or “step”clause as specified in 35 U.S.C. Section 112, Paragraph 6. Inparticular, the use of “step of” or “act of” in the claims herein is notintended to invoke the provisions of 35 U.S.C. 112, Paragraph 6.

Before describing the invention in detail, first a list of citedreferences is provided. Next, a description of the various principalaspects of the present invention is provided. Subsequently, anintroduction provides the reader with a general understanding of thepresent invention. Finally, specific details of various embodiment ofthe present invention are provided to give an understanding of thespecific aspects.

(1) List of Incorporated Literature References

The following references are cited throughout this application. Forclarity and convenience, the references are listed herein as a centralresource for the reader. The following references are herebyincorporated by reference as though fully set forth herein. Thereferences are cited in the application by referring to thecorresponding literature reference number, as follows:

-   -   1. Sotiras, Aristeidis, Susan M. Resnick, and Christos        Davatzikos. “Finding imaging patterns of structural covariance        via non-negative matrix factorization.” NeuroImage 108 (2015):        pages 1-16.    -   2. Simonyan, Karen, and Andrew Zisserman. “Very deep        convolutional networks for large-scale image recognition.” arXiv        preprint arXiv: 1409.1556(2014),    -   3. T. Lindeberg, “Scale-space theory in computer vision”,        volume 256. Springer Science & Business Media, 2013. Chapter 7.        pages 165-170.    -   4. [UrbanSound8K], found at        https://serv.cusp.nyu.edu/projects/urbansounddataset/urbansound8k.html    -   5. Xie, Junyuan, Ross Girshick, and Ali Farhadi. “Unsupervised        Deep Embedding for Clustering Analysis.” arXiv preprint        arXiv:1511.06335 (2015).    -   6. B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba.        “Object detectors emerge in deep scene CNNs.” arXiv preprint        arXiv:1412.6856, 2014.    -   7. A. Gonzalez-Garcia, D. Modolo, and V. Ferrari. “Do semantic        parts emerge in convolutional neural networks?”, arXiv preprint        arXiv:1607.03738, 2016.

(2) Principal Aspects

Various embodiments of the invention include three “principal” aspects.The first is a system for audio classification. The system is typicallyin the form of a computer system operating software or in the form of a“hard-coded” instruction set. This system may be incorporated into awide variety of devices that provide different functionalities. As anon-limiting example, the system can be implemented within an autonomousvehicle, such as a drone or automobile, etc. The second principal aspectis a method, typically in the form of software, operated using a dataprocessing system (computer). The third principal aspect is a computerprogram product. The computer program product generally representscomputer-readable instructions stored on a non-transitorycomputer-readable medium such as an optical storage device, e.g., acompact disc (CD) or digital versatile disc (DVD), or a magnetic storagedevice such as a floppy disk or magnetic tape. Other, non-limitingexamples of computer-readable media include hard disks, read-only memory(ROM), and flash-type memories. These aspects will be described in moredetail below.

A block diagram depicting an example of a system (i.e., computer system100) of the present invention is provided in FIG. 1. The computer system100 is configured to perform calculations, processes, operations, and/orfunctions associated with a program or algorithm. In one aspect, certainprocesses and steps discussed herein are realized as a series ofinstructions (e.g., software program) that reside within computerreadable memory units and are executed by one or more processors of thecomputer system 100. When executed, the instructions cause the computersystem 100 to perform specific actions and exhibit specific behavior,such as described herein.

The computer system 100 may include an address/data bus 102 that isconfigured to communicate information. Additionally, one or more dataprocessing units, such as a processor 104 (or processors), are coupledwith the address/data bus 102. The processor 104 is configured toprocess information and instructions. In an aspect, the processor 104 isa microprocessor. Alternatively, the processor 104 may be a differenttype of processor such as a parallel processor, application-specificintegrated circuit (ASIC), programmable logic array (PLA), complexprogrammable logic device (CPLD), or a field programmable gate array(FPGA).

The computer system 100 is configured to utilize one or more datastorage units. The computer system 100 may include a volatile memoryunit 106 (e.g., random access memory (“RAM”), static RAM, dynamic RAM,etc.) coupled with the address/data bus 102, wherein a volatile memoryunit 106 is configured to store information and instructions for theprocessor 104. The computer system 100 further may include anon-volatile memory unit 108 (e.g., read-only memory (“ROM”),programmable ROM (“PROM”), erasable programmable ROM (“EPROM”),electrically erasable programmable ROM “EEPROM”), flash memory, etc.)coupled with the address/data bus 102, wherein the non-volatile memoryunit 108 is configured to store static information and instructions forthe processor 104. Alternatively, the computer system 100 may executeinstructions retrieved from an online data storage unit such as in“Cloud” computing. In an aspect, the computer system 100 also mayinclude one or more interfaces, such as an interface 110, coupled withthe address/data bus 102. The one or more interfaces are configured toenable the computer system 100 to interface with other electronicdevices and computer systems. The communication interfaces implementedby the one or more interfaces may include wireline (e.g., serial cables,modems, network adaptors, etc.) and/or wireless (e.g., wireless modems,wireless network adaptors, etc.) communication technology.

In one aspect, the computer system 100 may include an input device 112coupled with the address/data bus 102, wherein the input device 112 isconfigured to communicate information and command selections to theprocessor 100. In accordance with one aspect, the input device 112 is analphanumeric input device, such as a keyboard, that may includealphanumeric and/or function keys. Alternatively, the input device 112may be an input device other than an alphanumeric input device. In anaspect, the computer system 100 may include a cursor control device 114coupled with the address/data bus 102, wherein the cursor control device114 is configured to communicate user input information and/or commandselections to the processor 100. In an aspect, the cursor control device114 is implemented using a device such as a mouse, a track-ball, atrack-pad, an optical tracking device, or a touch screen. The foregoingnotwithstanding, in an aspect, the cursor control device 114 is directedand/or activated via input from the input device 112, such as inresponse to the use of special keys and key sequence commands associatedwith the input device 112. In an alternative aspect, the cursor controldevice 114 is configured to be directed or guided by voice commands.

In an aspect, the computer system 100 further may include one or moreoptional computer usable data storage devices, such as a storage device116, coupled with the address/data bus 102. The storage device 116 isconfigured to store information and/or computer executable instructions.In one aspect, the storage device 116 is a storage device such as amagnetic or optical disk drive (e.g., hard disk drive (“HDD”), floppydiskette, compact disk read only memory (“CD-ROM”), digital versatiledisk (“DVD”)). Pursuant to one aspect, a display device 118 is coupledwith the address/data bus 102, wherein the display device 118 isconfigured to display video and/or graphics. In an aspect, the displaydevice 118 may include a cathode ray tube (“CRT”), liquid crystaldisplay (“LCD”), field emission display (“FED”), plasma display, or anyother display device suitable for displaying video and/or graphic imagesand alphanumeric characters recognizable to a user.

The computer system 100 presented herein is an example computingenvironment in accordance with an aspect. However, the non-limitingexample of the computer system 100 is not strictly limited to being acomputer system. For example, an aspect provides that the computersystem 100 represents a type of data processing analysis that may beused in accordance with various aspects described herein. Moreover,other computing systems may also be implemented. Indeed, the spirit andscope of the present technology is not limited to any single dataprocessing environment. Thus, in an aspect, one or more operations ofvarious aspects of the present technology are controlled or implementedusing computer-executable instructions, such as program modules, beingexecuted by a computer. In one implementation, such program modulesinclude routines, programs, objects, components and/or data structuresthat are configured to perform particular tasks or implement particularabstract data types. In addition, an aspect provides that one or moreaspects of the present technology are implemented by utilizing one ormore distributed computing environments, such as where tasks areperformed by remote processing devices that are linked through acommunications network, or such as where various program modules arelocated in both local and remote computer-storage media includingmemory-storage devices.

An illustrative diagram of a computer program product (i.e., storagedevice) embodying the present invention is depicted in FIG. 2. Thecomputer program product is depicted as floppy disk 200 or an opticaldisk 202 such as a CD or DVD. However, as mentioned previously, thecomputer program product generally represents computer-readableinstructions stored on any compatible non-transitory computer-readablemedium. The term “instructions” as used with respect to this inventiongenerally indicates a set of operations to be performed on a computer,and may represent pieces of a whole program or individual, separable,software modules. Non-limiting examples of “instruction” includecomputer program code (source or object code) and “hard-coded”electronics (i.e. computer operations coded into a computer chip). The“instruction” is stored on any non-transitory computer-readable medium,such as in the memory of a computer or on a floppy disk, a CD-ROM, and aflash drive. In either event, the instructions are encoded on anon-transitory computer-readable medium.

(3) Introduction

This disclosure provides a system and method to improve the recognitionperformance of a deep-learning network by learning salient soundattributes in an unsupervised manner and using this information inparallel to the deep network for improved classification of audio data.For example, if one of the classes of sounds is “children playing”, thisclass may be correlated with the sound attribute “bird song”; thus,learning how to recognize birdsong can help to identify children playingeven though birdsong was not explicitly labelled in a training data set.The system operates through a four-phase process, which allows forreliable classification of audio signals based on their attributes. Inthe first phase, the salient attributes of the input are extracted basedon the activation patterns of a deep convolutional neural network (CNN).In the second phase, the salient attributes are fed through the CNN toextract the hierarchical responses of the network to individual salientattributes. In the third phase, an iterative unsupervised learningapproach is applied to the network response to identify the keyattributes learned by the network. Finally, the input audio signal issummarized by a feature indicating the occurrence frequency of the keyattributes. The feature summarization allows for classification andcorresponding actions by a device in which the system is implemented(e.g., such as maneuvering a vehicle based on the audio classification).

Specifically, the system transforms audio signals in the time-domain totheir corresponding image-based representations as spectrograms. A CNNis trained to classify the audio data based on the spectrogramrepresentation of the data. The system starts with the trained CNN andlearns sound attributes that are encoded in distributed activationpatterns of the network. The prior art methods often utilizecorresponding image/spectrogram regions with highest activations of eachunit of a CNN to find salient attributes. In contrast to such methods,the system of this disclosure models the pattern of activations in agroup of CNN units as opposed to single units to find salientattributes. In addition, the system of the present disclosure combinesthe information extracted from key attributes with that of aconventional deep CNN to provide a significant boost in audioclassification performance compared to the prior art. Further detailsare provided below.

(4) Specific Details of Various Embodiments

As noted above, the present disclosure provides an audio classificationsystem. A key purpose of the system is to recognize salient attributesin spectrograms derived from audio signals. The audio signals may berecorded by one or more microphones and are converted from thetime-domain to the frequency domain using the Short-Time FourierTransform (STFT). The spectrograms may be single-channel, in which casethey carry only magnitude information, or multi-channel, in which casethey carry additional information such as phase. These one ormulti-channeled spectrograms are then processed to generateprobabilities for a given set of sound categories, as described in thefollowing. The category with the highest probability may identify themost prominent sound in the recording.

The system of this disclosure uses a convolutional neural network (CNN)to generate the related probabilities. See Literature Reference No. 2 inthe List of Incorporated Literature References for a description of aCNN. The present invention improves upon traditional CNNs by using anunsupervised scheme for identifying the learned key attributes of asound event. As shown in FIG. 3, the system includes a deep CNN 300 andreceives an audio spectrogram 302 as input into the deep CNN 300. Thekey attributes are learned by first identifying the regions of the inputspectrogram that are deemed salient by the network and then analyzingthe network's activation patterns in these salient regions. As a result,a histogram of key attributes (Bag of Key Attributes 304) is obtained.These key attributes 304 are then used to improve the accuracy ofcategory probabilities, which in turn could be used for decision making.The deep CNN 300 is used in three different ways: first, to filter theinput data 302 to obtain salient audio segments through SalientAttribute Extraction 306 (see Phase 1 below). Second, to convert salientinput patches into feature vectors through an Extracting GAP Features308 process (see Phase 2 below) (i.e., each patch has GAP features 500,and for all patches, there are list of GAP features 500), which allowsfor Unsupervised Clustering 316 to generate the key attributes. Third,the output 310 of the CNN 300 is concatenated 312 with the Bag of KeyAttributes 304 and mapped onto the classification probabilities using aclassifier 314 (e.g., Softmax Classifier). A device 800 can then becaused to operate based on the classification. Thus, described below areSalient Attribute Extraction 306, GAP-Features Extraction 308,Unsupervised Clustering 316 of salient attributes, and Bag ofKey-Attributes 304 extraction.

(4.1) Salient Attribute Extraction

The system starts by identifying salient regions of an input spectrogram302. Given a pre-trained CNN 300 and an input spectrogram 302, elasticNonnegative Matrix Factorization (NMF) is applied to the activationpatterns (i.e., last convolutional layer) of the CNN 300 to obtain andextract principal activation patterns 320 for the input data 302 (seeLiterature Reference No. 1 for a description of NMF). Note that sincethe fully connected layers of the CNN 300 are not used at this stage,the size of the input spectrogram could vary.

More precisely, let X=[x_(k)]_(k=1) ^(m)ϵR^(d×m) denote the vectorizedCNN 300 responses of the last convolutional layer (e.g. the ‘conv5_4’ ofVGG19) where m is the number of convolutional kernels at the last layer(e.g. m=512 in VGG19), and d is the number of nodes per convolutionalkernel and scales with the size of the input spectrogram (see LiteratureReference No. 2 for further details regarding vectorized CNN responses).Thereafter, the NMF is formulated as,

${\arg\;{\min_{W,H}{\frac{1}{2}{{X - {HW}}}_{F}^{2}}}} + {{\gamma\lambda}\left( {{W}_{1} + {H}_{1}} \right)} + {\frac{1}{2}{\gamma\left( {1 - \lambda} \right)}\left( {{W}_{F}^{2} + {H}_{F}^{2}} \right)}$where ∥⋅∥_(F) is the Frobenius norm, ∥⋅∥₁ is the elementwise L₁ norm,columns of HϵR^(d×r) are the non-negative components, WϵR^(r×m) is thenon-negative coefficient matrix, r is the rank of matrix H, whichcorresponds to the number of extracted components, and λ and γ areregularization parameters. A coordinate descent solver is used to find Hand W.

After extracting the non-negative components, columns of H, andupsampling (i.e., resizing to the original image size to counter theeffect of pooling layers) each component, the system processes eachcomponent by a Laplacian-of-Gaussian blob-detector to extract regions ofthe input spectrogram 302 that are considered salient by the CNN 300(see Literature Reference No. 3 for further details regarding aLaplacian-of-Gaussian blob-detector). Each component is an image andcorresponds to a column of H before upsampling. Here, the salientregions are time intervals designated as salient patches 322 that arecut from an audio signal (e.g., capturing a bird song in a playgroundscene).

FIG. 4 further illustrates the Salient Attribute Extraction 306 process.In relationship to FIG. 3, the process in FIG. 4 illustrates the flowfrom the audio spectrogram 302 through the deep CNN 300 and into theSalient Attribute Extraction box 306. The CNN 300 is used to learn theNMF components.

It should be noted that the input to the system is either a time-domainaudio signal 402 or a spectrogram 302 after having been converted fromthe audio signal 402. If an audio signal 402, the signal 402 isconverted 404 to the spectrogram 302 using any suitable conversiontechnique or system as understood by those skilled in the art, anon-limiting example of which includes the Short-Time Fourier Transform.The spectrogram 302 is passed to the pre-trained CNN 300, where thenon-negative matric factorization (NMF) components are computed 406 fromthe last convolutional layer of the CNN 300. Referring again to FIG. 4,blob detection 408 is then performed on each NMF component to generate acollection of blobs 410 using a blob detector (e.g.,Laplacian-of-Gaussian blob-detector as described by Lindeberg, seeLiterature Reference No. 3). The blobs 410 are used to extract 412salient patches 322 from the spectrogram 302. The system extracts 412the salient patches 322 using the blobs 410 by putting a tight boundingbox around the blobs, which is then used as a boundary to cut out thepatches from the spectrogram 302.

(4.2) GAP-Features Extraction

In phase 2, the system probes the activation patterns of the CNN 300 atdifferent layers and constructs a multi-scale feature for the extractedpatches. As shown in FIG. 5, the system extracts multi-scale GAPfeatures 500 from an input spectrogram salient patch 322 using apre-trained CNN. This is done by performing general average pooling(GAP) at each layer of the network right before the ‘max pooling’together with a normalization and concatenating the outputs.

The feature captures the response energy of various convolutionalkernels 502 at different layers, and provides a succinct representationof the CNN. The normalization is needed so the scale of average poolingat different layers is the same (i.e. range is zero to one).

(4.3) Unsupervised Clustering of Salient Attributes

In the third phase, having the salient patches 322 from all spectrogramsin the dataset and their corresponding GAP features 500, the systemutilizes an unsupervised learning framework to identify the keyattributes recognized by the network. As shown in FIG. 3, the systemutilizes iterative unsupervised deep embedding for clustering (DEC) 316as described by Xie et al. (see Literature Reference No. 5) to clusterthe salient extracted patches 322 as key attribute clusters. Theclustering is done on the GAP features, where each feature correspondsto a salient patch 322. The idea behind DEC is to transform the datainto a linear/nonlinear embedding space with richer data representationand cluster the data in that space.

(4.4) Bag of Key-Attributes Extraction

In the training phase, the outcome of the unsupervised deep embeddingmethod is a mapping, f_(α), that embeds the input GAP features into adiscriminant subspace, together with the key-attributes, μ_(j) for j=1,. . . , k. For a given input spectrogram, the system identifies thesalient regions of the spectrogram, extracts GAP features from the Midentified salient regions, v_(i) for i=1, M (M could vary for differentinput spectrograms), maps the features to the embedding via f_(α), andobtains their cluster memberships. Using the cluster memberships, thesystem generates the histogram of key attributes presented in aspectrogram, which encodes the normalized frequency of key-attributeoccurrences. This histogram counts the occurrences of key attributes inan audio recording. For instance, the bag of key attributes (BoKA)feature for a playground scene would encode existence and frequency ofcorresponding key sound attributes like, e.g., laughing and bird song.

In the test phase, for a given input spectrogram, its BoKA feature iscalculated using the process above (computation of salient regionextraction, GAP features, and BoKA feature), but without relearning theunsupervised clustering. For classification, the resulting histogram isconcatenated to the output of the CNN right before a Softmax classifier.In this manner, the network's extracted feature is enriched with anemphasis on the key learned attributes.

The Softmax layer of the CNN needs to be retrained to account for theBoKA feature. This retraining happens after computing the BoKA featuresof all training patterns and uses either the same training patterns or asubset for which classification labels are available.

The system schematic is shown in FIG. 6. As shown, a test inputspectrogram 302 goes through the system, and the final classificationprobability 600 is obtained based on the concatenated 602 featuresobtained from the deep CNN 300, the salient patch extraction process306, and Bag of Key-Attributes 304 extraction. Supplementing apre-trained CNN with a histogram of the key attributes reduces the erroron a sound classification task without requiring any additional data.The features are concatenated 602 by simply concatenating the vectors,which allows them to be classified using any suitable classifier 314(e.g., Softmax classifier). For example, concatenating [1,0,3] and[4,5,6] becomes [1,0,3,4,5,6]. The classification provides a probabilitythat the audio signal is a particular class. If the classificationexceeds a predetermined threshold (e.g., 90%, etc. or any otherpredetermined number), then the audio signal is classified as theparticular object class. A device 800 can then be caused to operatebased on the classification.

(4.5) Control of a Device

As can be understood by those skilled in the art, there are a variety ofapplications in which such a robust and foolproof audio classificationsystem can be implemented, such as autonomous driving and rotorcraftoperation. As shown in FIG. 8, a processor 104 implementing the audioclassification process may be used to control a device 800 (e.g., amobile device display, a virtual reality display, an augmented realitydisplay, a computer monitor, a motor, a machine, a drone, a camera, anautonomous vehicle, etc.) based on discriminating (i.e., classifying)the object. Thus, the device 800 may be controlled to cause the deviceto move or otherwise initiate a physical action based on thediscrimination or classification of the object(s) generating the audiosignals.

In some embodiments, a drone or other autonomous vehicle may becontrolled to move based on the classification. For example, ifimplemented in an autonomous vehicle, the vehicle may be caused tomaneuver based on a particular classification, non-limiting examples ofwhich include driving away from “explosive” sounds, or applying brakesto slow down when a classification such as “children playing” isdetected. In yet some other embodiments, a camera may be controlled toorient towards the classification. In other words, actuators or motorsare activated to cause the camera (or sensor) to move or zoom in on thelocation where sought after object is detected, such as “peopletalking”.

For surveillance, the audio classification system may trigger an alarm,which could either be an audible sound or initiate a computer program toexecute further actions (e.g., turning on/off lights, gas (viaelectrically controlling gas valves), or electricity, etc.). Thus and ascan be appreciated, there are a number of devices that can be controlledbased on classification of the object(s) generating the audio signals.

(4.6) Reduction to Practice.

A key component of the system was tested to verify the classificationsuperiority of the system as compared to the prior art. Specifically,key-attribute extraction was tested on an audio data set. The trainingset contained 8732 sound excerpts (<=4 s) distributed in 10 classes(e.g., UrbanSound8K—see Literature Reference No. 4). The audio sampleswere converted into spectrograms. First, a deep network learned themapping from the spectrograms onto the class labels. A 10-layer networkwas used: two convolutional layers with 32 features followed by maxpooling, two more layers with 64 features followed by max pooling, fourmore layers with 128 features followed by max pooling, and, finally, twofully connected layers mapping onto a Softmax classifier. On the inputspectrogram, the kernel size was 60×3 with a stride length of 1, where60 was the number of spectrogram frequencies and 3 the number of timesteps of the kernel window. After each convolutional layer, ExponentialLinear Units (ELU) was used as nonlinearity.

As in Phase 1 above, the deep CNN was first used to extract salientpatches. In total, 19,048 spectrogram patches were extracted. Then, GAPfeatures were extracted, and the patches were clustered in anunsupervised way into 20 clusters. The patches closest to a clustercenter were evaluated. As a result and as shown in FIG. 7, examplepatches 700 were extracted from spectrograms that belong to soundclasses 702 such as “birds”, “dog bark”, and “siren”. The systemidentified a variety of salient sound attributes, including attributesthat were not labeled in the data set, such “bird song”. Furthermore,the system was able to isolate fine-grained sound attributes, e.g.,police car siren versus ambulance siren.

Finally, while this invention has been described in terms of severalembodiments, one of ordinary skill in the art will readily recognizethat the invention may have other applications in other environments. Itshould be noted that many embodiments and implementations are possible.Further, the following claims are in no way intended to limit the scopeof the present invention to the specific embodiments described above. Inaddition, any recitation of “means for” is intended to evoke ameans-plus-function reading of an element and a claim, whereas, anyelements that do not specifically use the recitation “means for”, arenot intended to be read as means-plus-function elements, even if theclaim otherwise includes the word “means”. Further, while particularmethod steps have been recited in a particular order, the method stepsmay occur in any desired order and fall within the scope of the presentinvention.

What is claimed is:
 1. An audio classification system, the systemcomprising: one or more processors and a memory, the memory being anon-transitory computer-readable medium having executable instructionsencoded thereon, such that upon execution of the instructions, the oneor more processors perform operations of: generating a set of activationpatterns of a pre-trained convolutional neural network (CNN) comprisinga plurality of layers by passing an input spectrogram of an audio signalto the CNN; applying non-negative matrix factorization (NMF) to the setof activation patterns; extracting a plurality of salient regions of theinput spectrogram corresponding to salient audio segments in the audiosignal; performing general average pooling (GAP) at each layer of theCNN; extracting a plurality of GAP features, where each GAP featurecorresponds to a salient region of the input spectrogram of the audiosignal; clustering the plurality of GAP features; generating a histogramof occurrences of key attributes in the audio signal; concatenating thehistogram of occurrences of key attributes in the audio signal toextracted CNN features, whereby the extracted CNN features are enrichedwith an emphasis on the key attributes in the audio signal; mapping theconcatenation of the histogram and extracted CNN features ontoclassification probabilities using a classifier; classifying the audiosignal as an object class based on the classification probabilities; andcausing a device to perform based on the classification of the audiosignal.
 2. The system as set forth in claim 1, wherein the salientregions are extracted based on the CNN's activation for each spectrogrampixel or group of pixels in the input spectrogram.
 3. The system as setforth in claim 1, wherein in clustering the plurality of GAP features,the GAP features are clustered using iterative unsupervised deepembedding for clustering.
 4. The system as set forth in claim 3, whereinthe device is an autonomous vehicle, such that autonomous vehicleperforms a physical maneuver operation based on the classification ofthe audio signal.
 5. The system as set forth in claim 1, wherein thedevice is an autonomous vehicle, such that autonomous vehicle performs aphysical maneuver operation based on the classification of the audiosignal.
 6. A computer program product for audio classification, thecomputer program product comprising: a non-transitory computer-readablemedium having executable instructions encoded thereon, such that uponexecution of the instructions by one or more processors, the one or moreprocessors perform operations of: generating a set of activationpatterns of a pre-trained convolutional neural network (CNN) comprisinga plurality of layers by passing an input spectrogram of an audio signalto the CNN; applying non-negative matrix factorization (NMF) to the setof activation patterns; extracting a plurality of salient regions of theinput spectrogram corresponding to salient audio segments in the audiosignal; performing general average pooling (GAP) at each layer of theCNN; extracting a plurality of GAP features, where each GAP featurecorresponds to a salient region of the input spectrogram of the audiosignal; clustering the plurality of GAP features; generating a histogramof occurrences of key attributes in the audio signal; concatenating thehistogram of occurrences of key attributes in the audio signal toextracted CNN features, whereby the extracted CNN features are enrichedwith an emphasis on the key attributes in the audio signal; mapping theconcatenation of the histogram and extracted CNN features ontoclassification probabilities using a classifier; classifying the audiosignal as an object class based on the classification probabilities; andcausing a device to perform based on the classification of the audiosignal.
 7. The computer program product as set forth in claim 6, whereinthe salient regions are extracted based on the CNN's activation for eachspectrogram pixel or group of pixels in the input spectrogram.
 8. Thecomputer program product as set forth in claim 6, wherein in clusteringthe plurality of GAP features, the GAP features are clustered usingiterative unsupervised deep embedding for clustering.
 9. The computerprogram product as set forth in claim 8, wherein the device is anautonomous vehicle, such that autonomous vehicle performs a physicalmaneuver operation based on the classification of the audio signal. 10.The computer program product as set forth in claim 6, wherein the deviceis an autonomous vehicle, such that autonomous vehicle performs aphysical maneuver operation based on the classification of the audiosignal.
 11. A computer implemented method for audio classification, themethod comprising an act of: causing one or more processers to executeinstructions encoded on a non-transitory computer-readable medium, suchthat upon execution, the one or more processors perform operations of:generating a set of activation patterns of a pre-trained convolutionalneural network (CNN) comprising a plurality of layers by passing aninput spectrogram of an audio signal to the CNN; applying non-negativematrix factorization (NMF) to the set of activation patterns; extractinga plurality of salient regions of the input spectrogram corresponding tosalient audio segments in the audio signal; performing general averagepooling (GAP) at each layer of the CNN; extracting a plurality of GAPfeatures, where each GAP feature corresponds to a salient region of theinput spectrogram of the audio signal; clustering the plurality of GAPfeatures; generating a histogram of occurrences of key attributes in theaudio signal; concatenating the histogram of occurrences of keyattributes in the audio signal to extracted CNN features, whereby theextracted CNN features are enriched with an emphasis on the keyattributes in the audio signal; mapping the concatenation of thehistogram and extracted CNN features onto classification probabilitiesusing a classifier; classifying the audio signal as an object classbased on the classification probabilities; and causing a device toperform based on the classification of the audio signal.
 12. The methodas set forth in claim 11, wherein the salient regions are extractedbased on the CNN's activation for each spectrogram pixel or group ofpixels in the input spectrogram.
 13. The method as set forth in claim11, wherein in clustering the plurality of GAP features, the GAPfeatures are clustered using iterative unsupervised deep embedding forclustering.
 14. The method as set forth in claim 13, wherein the deviceis an autonomous vehicle, such that autonomous vehicle performs aphysical maneuver operation based on the classification of the audiosignal.
 15. The method as set forth in claim 11, wherein the device isan autonomous vehicle, such that autonomous vehicle performs a physicalmaneuver operation based on the classification of the audio signal.